Skip to content

fix(releases): Prevent row-lock contention on last_seen bump#115443

Merged
yuvmen merged 7 commits into
masterfrom
yuvmen/fix-rpe-last-seen-contention
May 13, 2026
Merged

fix(releases): Prevent row-lock contention on last_seen bump#115443
yuvmen merged 7 commits into
masterfrom
yuvmen/fix-rpe-last-seen-contention

Conversation

@yuvmen
Copy link
Copy Markdown
Member

@yuvmen yuvmen commented May 12, 2026

Summary

  • High-volume releases (e.g. popular mobile apps) cause concurrent ingest workers to pile up trying to UPDATE last_seen on the same ReleaseProjectEnvironment/ReleaseEnvironment row, triggering PostgreSQL statement_timeout cancellations (SENTRY-5HQZ — 6105 occurrences).
  • The unhandled OperationalError aborts the rest of the event save pipeline (nodestore persistence, release counts, group release records), causing silent data loss.
  • Adds a cache.add-based distributed lock so only one worker per row per 60s attempts the DB update, eliminating the thundering herd. Also catches OperationalError as a safety net so a failed best-effort last_seen bump never blocks event processing.
  • Applied the same fix to both ReleaseProjectEnvironment and ReleaseEnvironment which had identical vulnerable patterns.

Fixes SENTRY-5HQZ

Test plan

  • Existing tests pass (3 for RPE, 1 for RE)
  • New test: test_bump_skipped_when_cache_lock_held — verifies second worker skips the DB update when the cache lock is held
  • New test: test_bump_survives_operational_error — verifies OperationalError is caught and instance is returned
  • mypy and ruff pass

…aseProjectEnvironment and ReleaseEnvironment

High-volume releases cause concurrent workers to pile up on the same row's
UPDATE for last_seen, hitting statement_timeout (SENTRY-5HQZ). This adds a
cache-based distributed lock so only one worker per row per 60s attempts the
DB update, and catches OperationalError so a failed bump doesn't abort the
rest of the event save pipeline.

Fixes SENTRY-5HQZ
@yuvmen yuvmen requested a review from a team as a code owner May 12, 2026 22:07
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 12, 2026
Comment thread src/sentry/models/releaseenvironment.py Outdated
if cache.add(bump_key, "1", timeout=60):
try:
cls.objects.filter(
id=instance.id, last_seen__lt=datetime - timedelta(seconds=60)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be updated to last_seen__lt=datetime now, since the lock is doing the work for us

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true

if cache.add(bump_key, "1", timeout=60):
try:
cls.objects.filter(
id=instance.id, last_seen__lt=datetime - timedelta(seconds=60)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

yuvmen added 2 commits May 12, 2026 15:20
The cache lock handles the 60s throttle now, so the SQL filter only
needs to prevent setting last_seen backwards.
Comment thread src/sentry/models/releaseenvironment.py Outdated
Comment on lines +69 to +81
if cache.add(bump_key, "1", timeout=60):
try:
cls.objects.filter(id=instance.id, last_seen__lt=datetime).update(
last_seen=datetime
)
except OperationalError:
metric_tags["bumped"] = "error"
return instance
instance.last_seen = datetime
cache.set(cache_key, instance, 3600)
metric_tags["bumped"] = "true"
else:
metric_tags["bumped"] = "skipped"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a use-case that our buffers system can handle. We currently use buffers to handle updating last_seen times on Group and Release models. It could be used here as well. However, what you have is operationally lighter and if we don't need to capture the tail of each update set it will work well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I didnt even consider that, but yea I think pragmatically maybe its better to just keep this simple way here, and this is much less critical to get absolutely correct

Comment on lines +91 to +103
if cache.add(bump_key, "1", timeout=60):
try:
cls.objects.filter(id=instance.id, last_seen__lt=datetime).update(
last_seen=datetime
)
except OperationalError:
metrics_tags["bumped"] = "error"
return instance
instance.last_seen = datetime
cache.set(cache_key, instance, 3600)
metrics_tags["bumped"] = "true"
else:
metrics_tags["bumped"] = "skipped"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With two of these (and potentially more) should we have a context manager to encapsulate this behavior?

with periodic_update(key=bump_key, timeout=60, metrics_tags=metrics_tags):
    cls.objects.filter(id=instance.id, last_seen__lt=datetime).update(last_seen=datetime)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I wondered about the duplication, that a nce wrapper though, ill try and see if I can reduce it to something clean since it has an early return there and a cache update

The throttled last_seen bump pattern was duplicated across
ReleaseEnvironment and ReleaseProjectEnvironment. Extract it
into sentry/utils/last_seen.py so both models share one implementation.
Comment thread src/sentry/models/releaseenvironment.py
yuvmen added 2 commits May 13, 2026 10:53
Model doesn't have a last_seen attribute as far as mypy knows.
Using Any avoids the attr-defined errors without needing a Protocol.
…ted rows

Use a HasLastSeen Protocol to type the instance parameter instead of
Any. Keep model_class as Any since typing the Django manager chain
is impractical. Also restore the bumped=false metric tag when a row
is newly created, which was dropped during the extraction.
@yuvmen yuvmen enabled auto-merge (squash) May 13, 2026 19:07
@yuvmen yuvmen merged commit c1d42ce into master May 13, 2026
71 checks passed
@yuvmen yuvmen deleted the yuvmen/fix-rpe-last-seen-contention branch May 13, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants